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Abstract 



Procedures for establishing standards and 
determining the number of items needed in 
criterion-refereoced measures are reviewed. 
The discussion of setting a passing score 
is organized around five factors: perform- 

ance of others, item content, educational 
consequences, psychological and financial 
costs, and measurement error. Classical 
test theory, binomial, and sequential models 
for determining test length are considered. 
An illustrative table relating test length, 
proficiency standard, and required accuracy 
is provided. 
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In recent years there has been much attention given to criterion- 
referenced measures which relate test performance to absolute standards 
rather than to the performance of others. Popham and Husek (1969) provide 
a readable account of the differences between such measures and the more 
traditional norm- referenced tests. The purpose of this paper is to 
synthesize much of the literature on establishing standards and deter- 
mining the number of items needed in cri ter ion- referenced measures. 

This paper is written from the fol laving perspective. A domain 
(i.e., population) of dichotomous 1y scorable test "items" is conceptualized. 
This population of items need not actually exist, l/hat is important, though, 
is that it is described wel 1 enough so that a relatively high degree of agree 
ment can be reached about which kinds of items are or are not members of the 

population. In practice, only a reasonably representative sample of Items is 

• ^ 2 
requi red. 



The items of a domain may be heterogeneous in content, form and 
difficulty. In practice, however, they should be measures of a limited number 
of skills and knowledges so that it makes sense to establish a single profi- 
ciency standard. 

Of interest is the proportion of such items a student can pass. It 
is assumed that some educational decision, e.g., the nature of subsequent 
instruction for the student, is conditional upon whether or not he exceeds 
a proficiency standard when administered a sample of items from the domain. 
Thus, attention is directed toward the individual examinee and his perfor- 
mance relative to the standard rather than toward producing indicators of 
group performance. 
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PASSING SCORE 

"The establishment of standards of ach ievement . . . is exceedingly 

complex and subjective (it) is a task not to be attempted lightly." 

(Science Research Associates, 1966, p. 16) The frequent practice of 
employing a particular passing score (say, e.g. , 65%) only on the grounds 
of tradition is difficult to defend, in part because it seems unreasonable 
to require the same level of proficiency for all domains and all individuals 
and in part because there are other sources beside tradition which should 
be considered when determining a standard. The discussion below of several 
such sources and practices does not conclude with a single recommended 
method for calculating a passing score, but rather the intention is to 
alert the reader to information and procedures which should be considered 
when a standard is being established. 

None of the procedures eliminates the need for judgment. The focus 
of this rational thought shifts when each of the fol laving five sources of 
information is utilized. 

Performance of Others 

One procedure which has a degree of rationality is to set the passing 
score such that a predetermined percent of students pass. Test construction 
suggestions in this situation have been provided (see, e.g., Tinkelman, 1971 ) • 
Whether an individual passes under this scheme depends, in part, on the general 
competence of the others taking the test. This procedure is most applicable 
when the number of people who can or should be given some treatment or "certifi 
cation" is fixed and the assessment task is to select the ablest examinees. 
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I tem Content 



One source to turn when deciding upon a passing score is the test 
items themselves. Each item in the test can be inspected and a judgment 
made about how important it is that it be answered correctly. Such a 
procedure is the following: 



Suppose that a standard (not to be 
confused with “standard scores") of 
satisfactory performance is to be 
established for the. twel fth grade. 

The first step would be to study 
with care each of the individual 
problems ... and to decide, in terms 
of one's own subjective notions of 
the adult needs of the typical 
high school graduate, how many of 
these problems the typical beginning 
senior shou 1 d be able to solve. In 
other words, one would have to decide 
subjectively what raw score on this 
test the typical senior ought to 
make or exceed. (Science Research 
Associates, 1966, p. 16) 



A similar suggestion, with a probabilistic variant, has been offered 
by Angoff ( 1971) . 



...ask each judge to state the 
probabi 1 i ty that the "minimally 
acceptable person" would answer 
each item correctly. In effect, 
the judges would think of a number 
of minimally acceptable persons, 
instead of only one such person, 
and would estimate the proportion 
of minimally acceptable persons 
who would answer each item correctly. 
The sum of these probabilities, or 
proportions, would then represent the 
minimally acceptable score, (p. 515 ) 
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Alternatively, a decision might be made that to pass the test 
an acceptable answer must be given to all the items in one group, that 
some fraction of the items in a second group must be answered correctly, 
and that only a smaller fraction of the remaining items need be answered 
in an acceptable way. A plan somewhat similar to this, but which leads 
to a single passing score, is offered by libel (in press). 

A different procedure for converting judgments about the accept- 
ability of each option of each item to a passing score and letter grades 
has been described within a classroom context almost 20 years ago by 
Leo Uedelsky (135*0. A procedure similar to iledelsky's is being used 
in certifying candidates as eligible for the degree of Doctor of Medicine 
at the University of Illinois College of Medicine. (Crawford, 1970) A 
"Minimum Pass Level" for mul tiple-choi ce items is constructed as follows. 
Each option of each item is scrutinized and the options (including the 
keyed one) which a barely passing student might experience difficulty 
in discriminating are identified. Let o. be the number of such options 
in item j and (h be the total number of options in item j_. The required 
passing score is then equal to the fraction o./jO. summed over all items. 

A student's score is merely the total number of items he answers 

correct ly. 



A weakness of this system is that it gives a premium to an 
examinee for knowing (not guessing) the keyed option to, say, n^ items. 

That premium is an opportunity to choose, for each of other items, 
a foil which a barely passing student should not experience difficulty 
in eliminating and to still be above the passing level. In fields like 
medicine, the system permits a student who knows the correct answer to 
jn items to select "dangerous" options to another ji items and still not 
fail the test. One wonders why all tough-to-discr iminate options are not 
treated as correct options and a passing score determined in a manner similar 
to those suggested in the opening paragraphs of this subsection. 
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Educational Consequences 



A primary consideration in setting a required level of profi- 
ciency is its effect on future learning. If the level is set too low, 
students will be given instruction on new concepts and skills which they 
cannot handle effectively. If the level is too stringent, efficiency is 
reduced as students spend needless time on remedial materials. Very 
helpful would be any data on what happens when students with differing 
degrees of proficiency in a given knowledge domain are subjected to the 
alternative instructional sequences. 

Procedures for determining passing scores when test and criteria 
data are both available are provided by Cormuth (1971). Although Gormuth's 
research involved concurrently available criteria, the regression procedures 
are general izab le to the situation in which performance data are acquired 
after the alternative educational decisions have been implemented. 

In absence of data about educational consequences, the following 
guideline can be offered. If, on the basis of a logical analysis of the 
subject matter and the extant instructional system, the knowledges and 
skills are seen as fundamental or prerequisite to future learning, then 
a high proficiency level should be required. A lower passing score can 
be tolerated when the material is not seen as completing a necessary link 
in the development of some more complex concept or skill, especially if 
the ideas will be covered again in the curriculum. Application of solely 
this guideline would probably result in higher passing scores for tests 
on "basic" topics in mathematics than for tests covering "units" in social 
studies or English grammar. Tests of performances not viewed as prerequisite 
for future learning probably should not have passing scores and not be 
criterion-referenced. (Garvin, 1970 
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All things being equal, a low passing score should be used 
when the psychological and financial costs associated wi th a remedial 
instructional program are relatively high. That is, there should be 
fewer failures when the costs of failing are high. These "costs" might 
include lower motivation and boredom, damage to self-concept, and dollar 
and time expenses of conducting a remedial instructional program. A 
higher passing score can be tolerated when the above costs are not too 
great or when the negative effects of moving a student too rapidly 
through a curriculum ( i - e . , confusion, inefficient learning, etc.) 
are seen as very important to avoid. 

Emerick (1971) and Kriewall (1969) have reported procedures 
which utilize, at least indirectly, the ratio of the two kinds of costs 
in arriving at a passing score. Unfortunately, Emerick's model employs 
some very restrictive assumptions which makes it inapplicable to the 
situation being considered here and of limited value. Specifically, 
the domain of items are viewed* as "highly homogeneous in terms of con- 
tent, form, and difficulty level." (p. 322 ) Further, "an examinee can 
occupy only one status with respect to the skill being tested: mastery 

or nonmastery." (p. 322 ) Thus, when a student misses an item it is 
assumed to be the result of measurement error rather than partial know- 
ledge. lie agree with Ebel (1971) that "abilities, understandings, and 
appreciations are in the experience of almost everyone, not all-or-none 
adaptations. They are matters of degree. Mone but the simplest of them 
can ever be mastered completely by anyone." (p. 287) 

Kriewall 's model is applicable when the student has partial know- 
ledge. It does not employ the restrictive and impractical equal item diffi- 
culty assumption, although Kriewall himself and Besel (1971) appear to argue 
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otherwise. Kriewall's model is similar to Erne rick's in that given numerical 
values of the degree to which certain costs or errors of classifying students 
will be tolerated and the length of the test, a passing score can be computed 
which minimizes the costs (errors). A comparison of the Emerick and Kriewall 
approaches is provided by Besel (1971). 

Measurement Error 



There is a systematic error introduced in estimating an examinee's 
proficiency when the test item format allows a student to answer items 
correctly by guessing. The passing score could be raised to take into 
account the expected contribution attributed to pure guessing. Alter- 
natively, each student's score could be adjusted according to the standard 
correction-for-guessing formula and this adjusted score compared to the 
standard. Since pure , random guessing occurs rarely, adjusting either 
the examinee's score or the standard, as described above, will be expected 
to control the guessing contribution only partially. (Emerick's procedure 
mentioned above also takes into consideration a guessing factor; Kriewall's 
does not.) 



An additional error in the measuring process is expected when, 
for reasons of difficulty of construction, inconvenience of administration, 
or ignorance, the variety of types of questions and content represented in 
the domain are not used in the test, './hen the test items are thus suspected 
to be unrepresentative, it is well to raise or lower the standard an additional 
amount in order to protect against the misclass i f i cat ion error (examinee passes 
when he should fail, examinee fails when he should pass) feared the more. 

Even a student's corrected- for-guess ing percent score on a random 
sample of test items will usually not equal the true proportion of all the 
items in the domain to which he "really" knows the answer. This expected 
random measurement error can be reduced by using more test items. The 
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relation of test length and proficiency standard to the accuracy in classifying 
students will be dealt with in the next part of this paper. 

TEST LENGTH 



Recall that this paper is written from the perspective that a 
domain of dichotomous ly scorable test "items" is conceptualized and that 
an estimate of the proportion of such items an examinee can answer correctly 
is desired. 



Rather than sample problem solving 
behavior across a hypothetical popu- 
lation of pupils, it is more appropri- 
ate to measure the individual's behavior 
on a random sample of problems drawn 
from a clearly defined population of 
problems. The individual's relative 
score on this sample can then be 
interpreted as an estimate of his 
proficiency relative to that class 
of problems. (Kriewall, I969, p. 37) 



test length problem is determining the size of such 
needed to acquire an estimate having a specified 



In this context, the 
a sample of problems 
level of accuracy. 



Class ical Test Theo ry 

The classical test theory approach to determining accuracy and 
test length makes use of the standard error of measurement. It follows 
from the assumptions of the model that the standard error of measurement 
is constant for all true scores, a condition that probably is not true in 
practice. Further, in order to convert numerical values of standard errors 
into probability statements dealing with score accuracies, it is necessary 
to know or make assumptions about the error distributions. The usual normal 
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distribution assumption is most vulnerable in those situations often found 
with domai n- referenced tests; namely, when the number of items used to 
measure a particular proficiency is small and when the performance standard 
approaches the ceiling of the test. Finally, the value of a standard error 
for a test depends upon the group of examinees on whom the test v/as administered, 
and this is in conflict with the context of the problem as described above. 

Although excellent in many respects, the Livingston (1972) approach 
to the reliability of criterion-referenced tests does not help overcome these 
problems. The, standard error of a test is the same regardless whether the 
classical or criterion-referenced reliability coefficient is used. (Harris, 

1972) Remaining are the limitations of the standard error of measurement 
in determining the accuracy of a test score and, in turn, the required test 
length*.—"— **•• 

binomial Mode 1 



If one assumes that the proficiency test represents a randomly 
selected set of 0-1 scored items from some domain of tasks, and if one 
further assumes that the experience of taking the earlier items on the 
test does not influence the examinees chances of passing the later items, 
then an exact solution to the number-of- i tems-needed problem is given by 
the binomial distribution (for infinite or very large domains) or by the 
hyper geometric distribution (for relatively small item domains). Mo 
assumption regarding item homogeneity (in content or difficulty) is needed. 
(See Lord and Movick, 1 963 , section 11.9, for distribution statistics 
associated wi th this model.) 

Mo group measures or item indices computed over examinees are 
utilized in this binomial conceptualization. Rather, the items which an 
examinee can pass and those the individual fails are analogous to two 




10 



JO. 



colors of balls in an urn. Continuing the analogy, the test length question 
is, ilow many balls must be sampled (items administered) so that the percent 
of all balls in the urn of a given color (test items in the domain answered 
correctly) can be estimated accurately? The urns associated with other 
examinees are of no concern. 

Tables relating test length to accuracy for a given passing score 

3 

have been constructed by Hillman (1972) using this model. Table 1 displays 
the relevant data when an 802 passing score is selected. 

* * 

To illustrate the use of Table 1, suppose that an 



Table 1 about here 



educator is willing to tolerate a 25% misclassi f ication error (i.e., classify 
as "pass" a student who does not know 802 of all the items or vice versa) for 
those students who actually know 70% or 202 of the items. Reading down the 
702 and 902 columns, note that roughly 25% errors (actually 262 and 19%) will 
occur if a random sample of eight items from the domain are used and a passing 
score of seven imposed. 

Se quential Mode 1 s 

Sequential testing procedures in which the number of items ultimately 
given to a student depends upon the closeness of his performance relative to 
a passing percentage have been suggested (Kriewall, 1969; Ferguson, 1970). 

These schemes are based upon earlier work by \/ald (19^7). Sequential testing 
can also be conducted within a iiayesian framework. Examples how this might 
be done are provided by Powers (1971). 
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The primary advantage of employing such models is that fewer 
test items, on the average, are needed to acquire a given overall level of 
accuracy. Such procedures are most feasible when examinees interact with 
computers during testing. When paper and pencil tests are used, it would 
appear more efficient to administer all students a somewhat more generous, 
but equal, number of test items. 
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2 

In contrast to "cri terion", the term ‘‘domain" is reserved for the case 
b when an item generation procedure is employed or a universe is postulated 

and the items used are considered to be a representative sample. Cronbach 
et al. (1963) used the notion of a universe of test items in a theory of 
reliability. Osburn (1968) seems to have made the first in-print state- 
ment of the precise definition of domain. 

One set of procedures for generating such a domain makes use of item 
forms. A thorough description of the techniques and examples may be 
. found in Maxwell e£ a^. (1970 with further descriptions and extensions 

in Mussell (1969) and Rabehl (1970* Other item generating schemes have 
been proposed by Anderson (undated), 'Jormuth (1970), Guttman (see, e.g., 
Jordan, 19,70 » and to some extent by Scandura (see, e.g., Durnin and 
Scandura, 1970 • 

A less formal way of conceptualizing a domain of items is to list the 
specific instructional objectives included wi thin the domain in a 
manner such that it becomes evident what items will be included in 
and excluded from the domain. This latter procedure was followed in 
the development of the revised collections of objectives in the basic 
skill areas published by the Instructional Objectives Exchange, P.0, 
iiox 24095, Los Angeles, 90024. 

3 

1 Also considered in the Hillman reference is the mathematically parallel 

problem of determining the number of examinees needed to measure accurately 
the proportion of all students able, to answer a given item correctly. 
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TABLE 1 



Student Assessment 



Minimum Passing Percent gf 



PERCENT OF STUDEilTS EXPECTED TO BE MISCLASSI FIED 



ERLC 



Pass i ng 



No. of 



STUDENT'S TRUE LEVEL-OF-FUilCT I ON I I1G* 



Sco re 




Test 1 terns 


1*0 


50 


60 


70 


75 


i 

\ 


85 


90 


95 


l 


out of 


1 


1*0 


50 


60 


70 


75 


i 

l 


15 


10 


5 


2 


out of 


2 


16 


25 


36 


43 


56 


1 

l 


28 


19 


10 


3 

1. 


out of 


3 

l* 


6 


13 


22 


34 


42 


i 

1 


39 


27 


14 


4 


out of 


3 


6 


13 


24 


32 


1 

I 


48 


34 


19 


4 

r* 


out of 


5 

6 


3 


19 


34 


53 


63 


! 


16 


8 


2 


5 


out of 


l* 


1 1 


23 


42 


53 


i 

1 


22 


11 


3 


6 

7 


out of 


7 


2 


6 


16 


33 


44 


1 

i 


28 


15 


4 


out of 


o 


1 


l» 


11 


26 


37 


i 

1 


34 


19 


6 


CO CO 


out of 


3 


- 


2 


7 


20 


30 


1 

I 


40 


23 


7 


out of 


10 


1 


5 


17 


38 


53 


! 


18 


7 


1 


10 
1 o 


out of 


12 


- 


2 


8 


25 


39 


1 

j 


26 


11 


2 


12 


out of 


15 


— 


2 


9 


30 


46 


i 

i 


18 


6 


1 


16 


out of 


20 


- 


1 


5 


24 


41 


i 

i 


17 


4 




20 


out of 


25 


- 


- 


3 


19 


38 


i 


16 


3 


- 


2*1 


out of 


30 


- 


- 


2 


16 


35 


i 

i 


15 


3 

2 




32 


out of 


1*0 


- 


- 


1 


11 


30 


l 

( 


14 


- 


1*0 


out of 


50 


- 


_ 




8 


26 


1 

i 


12 


1 

1 




1*8 


out of 


60 


“ 


- 


- 


6 


23 


l 

i 


1 1 


- 


60 


out of 


75 




_ 




4 


19 


i 

/ 


9 






80 


out of 


100 


' 






2 


15 


i 

i 

i 


7 


- 


- 




*The true level 


-of-funct ioni ng 


is the 


pe rcent 


of 


i terns 


a 


student would be able to 
universe of items. 


answer 


correct ly 


i f he 


we re 


given 


the enti re 




Students having true 


level 


-of-function i ng 


va 1 ues 


less 


than 



the minimum passing percent of 80 should fail a test composed of items 
from this universe. However, on any given test of finite length, some 
of these students will get over 80% of the items correct and be considered 
as "passers". The expected percent of such misclassi fications are given 
in the body of the table to the left of the dotted line. 



Students having true level-of-function i ng values greater than 
the passing percent of 80 should pass such a test. The percent of these 
students who will be misclassi fied as "failures" are shown in the table 
to the right of the dotted line. 
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